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1.  Introduction 

Plants  and  animals  frequently  appear  in  consumer  im¬ 
ages  but  are  often  incidental  background  objects  whose  spe¬ 
cific  fine-grained  details  cannot  be  seen.  For  instance,  con¬ 
sider  the  photo  on  the  left  side  of  Figure  1  -  what  species  of 
tree  is  highlighted  in  red?  The  answer  to  this  question  could 
provide  useful  information  about  the  photo  for  a  range  of 
applications.  Photo  organization  software  could  automati¬ 
cally  tag  images  with  species  names  of  flora  or  fauna  to  sup¬ 
port  content-based  retrieval  [1(  ].  Detecting  and  identifying 
species  could  help  to  infer  a  geo-tag  for  an  image  [4],  espe¬ 
cially  for  rural  photos  that  lack  other  geo-informative  evi¬ 
dence,  since  many  species  of  plants  and  animals  occur  only 
in  certain  regions  of  the  world  [3].  On  the  other  hand,  when 
images  are  already  geo-tagged,  recognizing  species  could 
support  citizen  science  applications  that  use  consumer  pho¬ 
tos  to  track  the  distribution  of  natural  phenomena  [8]. 

But  flora  identification  is  a  very  difficult  problem,  both 
for  computers  and  for  humans  that  are  not  domain  experts. 
(Did  you  correctly  identify  the  tree  in  Figure  1  as  a  Chilean 
Wine  Palm,  or  Jubaea  chilensis ,  which  is  endemic  to  central 
Chile?)  While  recent  work  has  considered  automated  tech¬ 
niques  for  fine-grained  classification,  including  classifying 
among  species  of  birds  [9]  and  leaves  [6],  these  papers  typ¬ 
ically  study  images  in  which  the  objects  of  interest  are  large 
and  have  distinctive  local  features  (like  shapes  of  individual 
leaves)  that  are  readily  visible.  Other  recent  work  has  built 
hybrid  human-computer  recognition  systems,  using  mid¬ 
level  visual  attributes  (image  features  that  are  both  visually 
distinctive  and  semantically-meaningful)  as  the  “language” 
to  allow  humans  and  computer  vision  algorithms  to  collab¬ 
orate  on  recognition  tasks  [1,2].  These  techniques  work 
well  in  domains  where  clean  common-language  visual  at¬ 
tributes  exist,  as  in  bird  recognition  with  attributes  like  “yel¬ 
low  beak”  and  “white  belly.”  But  these  techniques  are  hard 
to  apply  with  non-expert  users  who  lack  the  vocabulary  for 
describing  properties  of  objects,  especially  when  individual 
properties  of  the  object  are  not  visible  and  recognition  must 
rely  on  the  overall  “look”  of  the  object  (as  in  Figure  1). 
This  challenge  is  confounded  by  the  fact  that  specimens  of 
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Figure  1 :  Diagram  of  our  proposed  human-in-the-loop  sys¬ 
tem.  We  take  a  hand-marked  target  tree,  extract  features  and 
find  visually  similar  trees  in  a  library  of  annotated  images, 
ask  the  user  for  feedback  on  which  of  these  candidates  are 
similar,  update  the  distance  metric,  find  new  candidates,  and 
iterate  until  convergence  to  a  tree  label. 


the  same  species  differ  widely  in  appearance,  e.g.  in  plants 
due  to  factors  such  as  age,  climate,  disease,  pruning,  etc. 

In  this  ongoing  work,  we  are  developing  a  method  that 
involves  a  user  in  the  loop  to  aid  in  the  fine-grained  recogni¬ 
tion  of  a  diverse  set  of  tree  species.  Instead  of  asking  users 
to  provide  attributes  of  trees,  we  instead  ask  them  to  judge 
the  similarity  between  pairs  of  tree  images,  and  then  use  this 
to  learn  the  parameters  of  a  discriminative  distance  metric 
for  use  with  k-nearest  neighbors.  Over  time,  the  discrim¬ 
inative  distance  function  becomes  a  better  approximation 
to  the  human’s  judgment  of  visual  similarity.  We  present 
baselines  and  results  of  our  human-guided  approach  on  a 
collection  of  20  tree  species  from  five  geographic  locations. 

2.  Methodology 

Our  approach  has  an  offline  and  online  training  phase. 
We  assume  that  we  have  a  labeled  training  set  of  images 
that  are  cropped  tightly  around  single  tree  exemplars.  In  the 
offline  phase,  we  extract  global  features  from  each  cropped 
training  image  and  use  the  known  labels  in  the  training  set 
to  learn  the  parameters  of  a  distance  metric.  We  use  the 
regularized  online  distance  metric  learning  algorithm  of  Jin 
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Figure  2:  Human  interaction  GUI.  Given  a  target  image 
(top),  the  user  is  shown  candidate  matches  under  the  cur¬ 
rent  distance  metric  (bottom),  and  is  asked  to  indicate  which 
images  appear  to  be  true  matches. 

et  al  [5]  for  both  offline  and  online  learning.  Given  a  pair 
of  exemplars  with  known  class  labels,  the  approach  mini¬ 
mizes  a  regularized  loss  function  based  on  the  squared  Ma- 
halanobis  distance  as  a  function  of  the  covariance  matrix. 
We  adapt  this  approach  to  a  batch  method  by  using  it  as  a 
sub-routine  in  a  pocket-algorithm  fashion.  We  iterate  over 
all  pairs  in  the  training  set  and  evaluate  the  total  error.  As  in 
the  pocket-algorithm  we  retain  the  best  solution  so  far  until 
many  iterations  without  improvement  are  observed. 

In  the  online  phase,  human  interaction  is  used  to  improve 
the  distance  metric  and  recognition  results.  The  human 
user  selects  an  image  region  corresponding  to  an  unknown 
tree  of  interest.  We  compute  global  features  like  GIST  [7] 
from  that  region,  and  find  the  k  most  similar  images  in  our 
training  set  to  present  to  the  user.  The  user  then  indicates 
whether  each  of  these  tree  images  appears  to  be  of  the  same 
species  as  the  target  image  (Figure  2).  Using  these  positive 
(objects  are  similar)  and  negative  (objects  are  dissimilar)  re¬ 
sponses,  the  algorithm  updates  the  distance  metric  using  [:  ] 
and  presents  the  user  with  the  new  /^-nearest  neighbors  un¬ 
der  this  updated  distance  metric.  This  cycle  repeats  until  the 
user  thinks  most  candidates  are  similar  to  the  target  image, 
in  which  case  the  system  suggests  the  majority  label. 

3.  Evaluation 

For  a  preliminary  evaluation,  we  chose  four  indigenous 
tree  species  from  each  of  a  diverse  set  of  five  countries 
(Philippines,  Chile,  Jordan,  India,  and  Taiwan)  to  create  a 
20-way  classification  problem.  We  collected  a  dataset  of 
269  images  (from  Flickr  and  the  web)  distributed  approxi¬ 
mately  evenly  over  the  tree  classes.  We  withheld  about  10% 
of  these  images  as  a  test  set,  and  cropped  the  remaining  im¬ 
ages  around  the  tree  exemplars  to  produce  our  training  set. 

We  first  evaluated  a  fully-automatic  recognition  ap¬ 


proach.  Using  only  GIST  features  and  a  nearest-neighbor 
classifier  under  Euclidean  distance,  we  achieved  a  classifi¬ 
cation  accuracy  of  15.8%,  relative  to  a  5%  majority-class 
baseline.  After  learning  a  new  distance  metric  using  only 
offline  training,  the  fully-automatic  accuracy  increased  to 
26.3%.  We  then  evaluated  the  human-in-the-loop  technique 
using  a  simple  GUI  and  a  non-expert  human  user.  The  user 
was  asked  to  interact  with  the  system,  iteratively  selecting 
visually- similar  images  (which  the  system  was  using  to  up¬ 
date  the  distance  metric)  until  he  or  she  believed  that  most 
of  the  candidates  were  visually  similar  to  the  target  image, 
and  then  the  system  assigned  that  label.  The  user  attained 
an  accuracy  of  36.8%,  or  over  seven  times  baseline,  on  this 
challenging  fine-grained  categorization  task. 

4.  Conclusion 

Our  preliminary  results  demonstrate  the  potential  of  a 
human-in-the-loop  approach  to  solve  a  challenging  tree 
recognition  problem  that  would  be  difficult  or  impossible 
for  computers  or  humans  to  solve  individually.  This  is  on¬ 
going  work  and  we  are  continuing  to  explore  a  variety  of  di¬ 
rections,  including  using  more  sophisticated  visual  features, 
injecting  diversity  into  the  sets  of  candidates,  and  studying 
other  fine-grained  classification  tasks. 
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